Pandas

Pandas is a high-performance python library that provides a comprehensive set of data structures for manipulating tabular data, providing high-performance indexing, automatic alignment, reshaping, grouping, joining and statistical analyses capabilities.

The two primary data structures in pandas are the Series and the DataFrame objects.

Series Object

The Series object is the fundamental building block of pandas. A Series represents an one-dimensional array based on the NumPy ndarray but with a labeled index that significantly helps to access the elements.

A Series always has an index even if one is not specified, by default pandas will create an index that consists of sequential integers starting from zero. Access to elements is not by integer position but using values in the index referred as Labels.

Importing pandas into the application is simple. It is common to import both pandas and numpy with their objects mapped into the pd and np namespaces respectively.


In [1]:
import numpy as np
import pandas as pd
pd.__version__


Out[1]:
'0.20.1'

In [2]:
np.__version__


Out[2]:
'1.12.1'

In [3]:
# set some options to control output display
pd.set_option('display.notebook_repr_html',False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)

Creating Series

A Series can be created and initialised by passing either a scalar value, a NumPy nd array, a Python list or a Python Dict as the data parameter of the Series constructor.


In [4]:
# create one item series
s1 = pd.Series(1)
s1


Out[4]:
0    1
dtype: int64

'0' is the index and '1' is the value. The data type (dtype) is also shown. We can also retrieve the value using the associated index.


In [5]:
# get value with label 0
s1[0]


Out[5]:
1

In [6]:
# create from list
s2 = pd.Series([1,2,3,4,5])
s2


Out[6]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

In [8]:
# get the values in the series
s2.values


Out[8]:
array([1, 2, 3, 4, 5])

In [9]:
# get the index of the series
s2.index


Out[9]:
RangeIndex(start=0, stop=5, step=1)

Creating Series with named index

Pandas will create different index types based on the type of data identified in the index parameter. These different index types are optimized to perform indexing operations for that specific data type. To specify the index at the time of creation of the Series, use the index parameter of the constructor.


In [13]:
# explicitly create an index
# index is alpha, not an integer
s3 = pd.Series([1,2,3], index=['a','b','c'])
s3


Out[13]:
a    1
b    2
c    3
dtype: int64

In [14]:
s3.index


Out[14]:
Index(['a', 'b', 'c'], dtype='object')

Please note the type of the index items. It is not string but 'object'.


In [15]:
# look up by label value and not object position
s3['b']


Out[15]:
2

In [18]:
# position also works
s3[2]


Out[18]:
3

In [19]:
# create series from an existing index
# scalar value will be copied at each index label
s4 = pd.Series(2,index=s2.index)
s4


Out[19]:
0    2
1    2
2    2
3    2
4    2
dtype: int64

It is a common practice to initialize the Series objects using NumPy ndarrays, and with various NumPy functions that create arrays. The following code creates a Series from five normally distributed values:


In [20]:
np.random.seed(123456)
pd.Series(np.random.randn(5))


Out[20]:
0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
dtype: float64

In [22]:
# 0 through 9
pd.Series(np.linspace(0,9,10))


Out[22]:
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64

In [23]:
# o through 8
pd.Series(np.arange(0,9))


Out[23]:
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
dtype: int64

A Series can also be created from a Python dictionary. The keys of the dictionary are used as the index lables for the Series:


In [25]:
s6 = pd.Series({'a':1,'b':2,'c':3,'d':4})
s6


Out[25]:
a    1
b    2
c    3
d    4
dtype: int64

Size, Shape, Count and Uniqueness of Values


In [26]:
# example series which also contains a NaN
s = pd.Series([0,1,1,2,3,4,5,6,7,np.NaN])
s


Out[26]:
0    0.0
1    1.0
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
7    6.0
8    7.0
9    NaN
dtype: float64

In [27]:
# length of the Series
len(s)


Out[27]:
10

In [28]:
s.size


Out[28]:
10

In [29]:
# shape is a tuple with one value
s.shape


Out[29]:
(10,)

In [30]:
# number of values not part of NaN can be found using count() method
s.count()


Out[30]:
9

In [31]:
# all unique values
s.unique()


Out[31]:
array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,  nan])

In [32]:
# count of non-NaN values, returned max to min order
s.value_counts()


Out[32]:
1.0    2
7.0    1
6.0    1
5.0    1
4.0    1
3.0    1
2.0    1
0.0    1
dtype: int64

Peeking at data with heads, tails and take

pandas provides the .head() and .tail() methods to examine just the first few or last records in a Series. By default, these return the first five or last rows respectively, but you can use the n parameter or just pass an integer to specify the number of rows:


In [33]:
# first five
s.head()


Out[33]:
0    0.0
1    1.0
2    1.0
3    2.0
4    3.0
dtype: float64

In [34]:
# first three
s.head(3)


Out[34]:
0    0.0
1    1.0
2    1.0
dtype: float64

In [35]:
# last five
s.tail()


Out[35]:
5    4.0
6    5.0
7    6.0
8    7.0
9    NaN
dtype: float64

In [36]:
# last 2
s.tail(n=2) # equivalent to s.tail(2)


Out[36]:
8    7.0
9    NaN
dtype: float64

The .take() method will return the rows in a series that correspond to the zero-based positions specified in a list:


In [37]:
# only take specific items
s.take([0,3,9])


Out[37]:
0    0.0
3    2.0
9    NaN
dtype: float64

Looking up values in Series

Values in a Series object can be retrieved using the [] operator and passing either a single index label or a list of index labels.


In [39]:
# single item lookup
s3['a']


Out[39]:
1

In [40]:
# lookup by position since index is not an integer
s3[2]


Out[40]:
3

In [42]:
# multiple items
s3[['a','c']]


Out[42]:
a    1
c    3
dtype: int64

In [43]:
# series with an integer index but not starting with 0
s5 = pd.Series([1,2,3], index =[11,12,13])
s5[12]  # by value as value passed and index are both integer


Out[43]:
2

To alleviate the potential confusion in determining the label-based lookups versus position-based lookups, index based lookup can be enforced using the .loc[] accessor:


In [44]:
# force lookup by index label
s5.loc[12]


Out[44]:
2

Lookup by position can be enforced using the iloc[] accessor:


In [45]:
# force lookup by position or location
s5.iloc[1]


Out[45]:
2

In [46]:
# multiple items by index label
s5.loc[[12,10]]


Out[46]:
12    2.0
10    NaN
dtype: float64

In [47]:
# multiple items by position or location
s5.iloc[[1,2]]


Out[47]:
12    2
13    3
dtype: int64

If a location / position passed to .iloc[] in a list is out of bounds, an exception will be thrown. This is different than with .loc[], which if passed a label that does not exist, will return NaN as the value for that label:


In [48]:
s5.loc[[12,-1,15]]


Out[48]:
 12    2.0
-1     NaN
 15    NaN
dtype: float64

A Series also has a property .ix that can be used to look up items either by label or by zero-based array position.


In [49]:
s3


Out[49]:
a    1
b    2
c    3
dtype: int64

In [50]:
# label based lookup
s3.ix[['a','b']]


/Users/cnc/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  
Out[50]:
a    1
b    2
dtype: int64

In [51]:
# position based lookup
s3.ix[[1,2]]


Out[51]:
b    2
c    3
dtype: int64

This can become complicated if the indexes are integers and you pass a list of integers to ix. Since they are of the same type, the lookup will be by index label instead of position:


In [52]:
# this looks by label and not position
# note that 1,2 have NaN as those labels do not exist in the index
s5.ix[[1,2,10,11]]


/Users/cnc/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  This is separate from the ipykernel package so we can avoid doing imports until
Out[52]:
1     NaN
2     NaN
10    NaN
11    1.0
dtype: float64

Alignment via index labels

A fundamental difference between a NumPy ndarray and a pandas Series is the ability of a Series to automatically align data from another Series based on label values before performing an operation.


In [53]:
s6 = pd.Series([1,2,3,4], index=['a','b','c','d'])
s6


Out[53]:
a    1
b    2
c    3
d    4
dtype: int64

In [54]:
s7 = pd.Series([4,3,2,1], index=['d','c','b','a'])
s7


Out[54]:
d    4
c    3
b    2
a    1
dtype: int64

In [55]:
s6 + s7


Out[55]:
a    2
b    4
c    6
d    8
dtype: int64

This is a very different result that what it would have been if it were two pure NumPy arrays being added. A NumPy ndarray would add the items in identical positions of each array resulting in different values.


In [56]:
a1 = np.array([1,2,3,4,5])
a2 = np.array([5,4,3,2,1])
a1 + a2


Out[56]:
array([6, 6, 6, 6, 6])

The process of adding two Series objects differs from the process of addition of arrays as it first aligns data based on index label values instead of simply applying the operation to elements in the same position. This becomes significantly powerful when using pandas Series to combine data based on labels instead of having to first order the data manually.

Arithmetic Operations

Arithemetic Operations

(+,-,*,/)
can be applied either to a Series or between 2 Series objects


In [57]:
# multiply all values in s3 by 2
s3 * 2


Out[57]:
a    2
b    4
c    6
dtype: int64

In [58]:
# scalar series using the s3's index
# not efficient as it will no use vectorisation
t = pd.Series(2,s3.index)
s3 * t


Out[58]:
a    2
b    4
c    6
dtype: int64

To reinforce the point that alignment is being performed when applying arithmetic operations across two Series objects, look at the following two Series as examples:


In [59]:
# we will add this to s9
s8 = pd.Series({'a':1,'b':2,'c':3,'d':5})
s8


Out[59]:
a    1
b    2
c    3
d    5
dtype: int64

In [60]:
s9 = pd.Series({'b':6,'c':7,'d':9,'e':10})
s9


Out[60]:
b     6
c     7
d     9
e    10
dtype: int64

In [61]:
# NaN's result for a and e demonstrates alignment
s8 + s9


Out[61]:
a     NaN
b     8.0
c    10.0
d    14.0
e     NaN
dtype: float64

In [62]:
s10 = pd.Series([1.0,2.0,3.0],index=['a','a','b'])
s10


Out[62]:
a    1.0
a    2.0
b    3.0
dtype: float64

In [63]:
s11 = pd.Series([4.0,5.0,6.0], index=['a','a','c'])
s11


Out[63]:
a    4.0
a    5.0
c    6.0
dtype: float64

In [64]:
# will result in four 'a' index labels
s10 + s11


Out[64]:
a    5.0
a    6.0
a    6.0
a    7.0
b    NaN
c    NaN
dtype: float64

The reason for the above result is that during alignment, pandas actually performs a cartesian product of the sets of all the unique index labels in both Series objects, and then applies the specified operation on all items in the products.

To explain why there are four 'a' index values s10 contains two 'a' labels and s11 also contains two 'a' labels. Every combination of 'a' labels in each will be calculated resulting in four 'a' labels. There is one 'b' label from s10 and one 'c' label from s11. Since there is no matching label for either in the other Series object, they only result in a sing row in the resulting Series object.

Each combination of values for 'a' in both Series are computed, resulting in the four values: 1+4,1+5,2+4 and 2+5.

So remember that an index can have duplicate labels, and during alignment this will result in a number of index labels equivalent to the products of the number of the labels in each Series.

The special case of Not-A-Number (NaN)

pandas mathematical operators and functions handle NaN in a special manner (compared to NumPy ndarray) that does not break the computations. pandas is lenient with missing data assuming that it is a common situation.


In [65]:
nda = np.array([1,2,3,4,5])
nda.mean()


Out[65]:
3.0

In [66]:
# mean of numpy array values with a NaN
nda = np.array([1,2,3,4,np.NaN])
nda.mean()


Out[66]:
nan

In [67]:
# Series object ignores NaN values - does not get factored
s = pd.Series(nda)
s.mean()


Out[67]:
2.5

In [68]:
# handle NaN values like Numpy
s.mean(skipna=False)


Out[68]:
nan

Boolean selection

Items in a Series can be selected, based on the value instead of index labels, via the utilization of a Boolean selection.


In [69]:
# which rows have values that are > 5
s = pd.Series(np.arange(0,10))
s > 5


Out[69]:
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9     True
dtype: bool

In [71]:
# select rows where values are > 5
# overloading the Series object [] operator
logicalResults = s > 5
s[logicalResults]


Out[71]:
6    6
7    7
8    8
9    9
dtype: int64

In [72]:
# a little shorter version
s[s > 5]


Out[72]:
6    6
7    7
8    8
9    9
dtype: int64

In [73]:
# using & operator
s[(s>5)&(s<9)]


Out[73]:
6    6
7    7
8    8
dtype: int64

In [74]:
# using | operator
s[(s > 3) | (s < 5)]


Out[74]:
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [75]:
# are all items >= 0?
(s >=0).all()


Out[75]:
True

In [76]:
# are any items < 2
s[s < 2].any()


Out[76]:
True

The result of these logical expressions is a Boolean selection, a Series of True and False values. The .sum() method of a Series, when given a series of Boolean values, will treat True as 1 and False as 0. The following demonstrates using this to determine the number of items in a Series that satisfy a given expression:


In [77]:
(s < 2).sum()


Out[77]:
2

Reindexing a Series

Reindexing in pandas is a process that makes the data in a Series or DataFrame match a given set of labels.

This process of performing a reindex includes the following steps:

  1. Reordering existing data to match a set of labels.
  2. Inserting NaN markers where no data exists for a label.
  3. Possibly, filling missing data for a label using some type of logic

In [78]:
# sample series of five items
s = pd.Series(np.random.randn(5))
s


Out[78]:
0   -0.173215
1    0.119209
2   -1.044236
3   -0.861849
4   -2.104569
dtype: float64

In [79]:
# change the index
s.index = ['a','b','c','d','e']
s


Out[79]:
a   -0.173215
b    0.119209
c   -1.044236
d   -0.861849
e   -2.104569
dtype: float64

In [80]:
# concat copies index values verbatim
# potentially making duplicates
np.random.seed(123456)
s1 = pd.Series(np.random.randn(3))
s2 = pd.Series(np.random.randn(3))
combined = pd.concat([s1,s2])
combined


Out[80]:
0    0.469112
1   -0.282863
2   -1.509059
0   -1.135632
1    1.212112
2   -0.173215
dtype: float64

In [81]:
# reset the index
combined.index = np.arange(0,len(combined))
combined


Out[81]:
0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
5   -0.173215
dtype: float64

Greater flexibility in creating a new index is provided using the .reindex() method. An example of the flexibility of .reindex() over assigning the .index property directly is that the list provided to .reindex() can be of a different length than the number of rows in the Series:


In [82]:
np.random.seed(123456)
s1 = pd.Series(np.random.randn(4),['a','b','c','d'])
# reindex with different number of labels
# results in dropped rows and/or NaN's
s2 = s1.reindex(['a','c','g'])
s2


Out[82]:
a    0.469112
c   -1.509059
g         NaN
dtype: float64

There are several things here that are important to point out about .reindex() method.

  • First is that the result of .reindex() method is a new Series. This new Series has an index with labels that are provided as parameter to reindex().
  • For each item in the given parameter list, if the original Series contains that label, then the value is assigned to that label.
  • If that label does not exist in the original Series, pandas assigns a NaN value.
  • Rows in the Series without a label specified in the parameter of .reindex() is not included in the result.

To demonstrate that the result of .reindex() is a new Series object, changing a value in s2 does not change the values in s1:


In [83]:
# s2 is a different series than s1
s2['a'] = 0
s2


Out[83]:
a    0.000000
c   -1.509059
g         NaN
dtype: float64

In [ ]:
# this did not modify s1
s1

Reindex is also useful when you want to align two Series to perform an operation on matching elements from each series; however, for some reason, the two Series has index labels that will not initially align.


In [87]:
# different types for the same values of labels causes big issue
s1 = pd.Series([0,1,2],index=[0,1,2])
s2 = pd.Series([3,4,5],index=['0','1','2'])
s1 + s2


Out[87]:
0   NaN
1   NaN
2   NaN
0   NaN
1   NaN
2   NaN
dtype: float64

The reason why this happens in pandas are as follows:

  1. pandas first tries to align by the indexes and finds no matches, so it copies the index labels from the first series and tries to append the indexes from the second Series.
  2. However, since they are different type, it defaults back to zero-based integer sequence that results in duplicate values.
  3. Finally, all values are NaN because the operation tries to add the item in the first Series with the integer label 0, which has a value of 0, but can't find the item in the other series and therefore the result in NaN.

In [88]:
# reindex by casting the label types and we will get the desired result
s2.index = s2.index.values.astype(int)
s1 + s2


Out[88]:
0    3
1    5
2    7
dtype: int64

The default action of inserting NaN as a missing value during reindexing can be changed by using the fill_value parameter of the method.


In [89]:
# fill with 0 instead on NaN
s2 = s.copy()
s2.reindex(['a','f'],fill_value=0)


Out[89]:
a   -0.173215
f    0.000000
dtype: float64

When performing a reindex on ordered data such as a time series, it is possible to perform interpolation or filling of values. The following example demonstrates forward filling, often referred to as "last known value".


In [90]:
# create example to demonstrate fills
s3 = pd.Series(['red','green','blue'],index=[0,3,5])
s3


Out[90]:
0      red
3    green
5     blue
dtype: object

In [91]:
# forward fill using ffill method
s3.reindex(np.arange(0,7), method='ffill')


Out[91]:
0      red
1      red
2      red
3    green
4    green
5     blue
6     blue
dtype: object

In [92]:
# backward fill using bfill method
s3.reindex(np.arange(0,7),method='bfill')


Out[92]:
0      red
1    green
2    green
3    green
4     blue
5     blue
6      NaN
dtype: object

Modifying a Series in-place

There are several ways that an existing Series can be modified in-place having either its values changed or having rows added or deleted.

A new item can be added to a Series by assigning a value to an index label that does not already exist.


In [94]:
np.random.seed(123456)
s = pd.Series(np.random.randn(3),index=['a','b','c'])
s


Out[94]:
a    0.469112
b   -0.282863
c   -1.509059
dtype: float64

In [95]:
# change a value in the Series
# this done in-place
# a new Series is not returned that has a modified value
s['d'] = 100
s


Out[95]:
a      0.469112
b     -0.282863
c     -1.509059
d    100.000000
dtype: float64

In [96]:
# value at a specific index label can be changed by assignment:
s['d'] = -100
s


Out[96]:
a      0.469112
b     -0.282863
c     -1.509059
d   -100.000000
dtype: float64

Items can be removed from a Series using the del() function and passing the index label(s) to be removed.


In [97]:
del(s['a'])
s


Out[97]:
b     -0.282863
c     -1.509059
d   -100.000000
dtype: float64

Slicing a Series


In [98]:
# a series to use for slicing
# using index labels not starting at 0 to demonstrate
# position based slicing

s = pd.Series(np.arange(100,110),index=np.arange(10,20))
s


Out[98]:
10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
18    108
19    109
dtype: int64

In [99]:
# items at position 0,2,4
s[0:6:2]


Out[99]:
10    100
12    102
14    104
dtype: int64

In [100]:
# equivalent to
s.iloc[[0,2,4]]


Out[100]:
10    100
12    102
14    104
dtype: int64

In [101]:
# first five by slicing, same as .head(5)
s[:5]


Out[101]:
10    100
11    101
12    102
13    103
14    104
dtype: int64

In [102]:
# fourth position to the end
s[4:]


Out[102]:
14    104
15    105
16    106
17    107
18    108
19    109
dtype: int64

In [103]:
# every other item in the first five positions
s[:5:2]


Out[103]:
10    100
12    102
14    104
dtype: int64

In [104]:
# every other item starting at the fourth position
s[4::2]


Out[104]:
14    104
16    106
18    108
dtype: int64

In [105]:
# reverse the series
s[::-1]


Out[105]:
19    109
18    108
17    107
16    106
15    105
14    104
13    103
12    102
11    101
10    100
dtype: int64

In [106]:
# every other starting at position 4, in reverse
s[4::-2]


Out[106]:
14    104
12    102
10    100
dtype: int64

In [107]:
# :-2 which means positions 0 through (10-2) which is [8]
s[:-2]


Out[107]:
10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
dtype: int64

In [108]:
# last 3 items
# equivalent to tail(3)
s[-3:]


Out[108]:
17    107
18    108
19    109
dtype: int64

In [109]:
# equivalent to s.tail(4).head(3)
s[-4:-1]


Out[109]:
16    106
17    107
18    108
dtype: int64

An important thing to keep in mind when using slicing, is that the result of the slice is actually a view into the original Series. Modification of values through the result of the slice will modify the original Series.


In [110]:
# preserve s
# slice with first 2 rows
copy = s.copy()
slice = copy[:2]
slice


Out[110]:
10    100
11    101
dtype: int64

Now the assignment of a value to an element of a slice will change the value in the original Series:


In [111]:
slice[11] = 1000
copy


Out[111]:
10     100
11    1000
12     102
13     103
14     104
15     105
16     106
17     107
18     108
19     109
dtype: int64

Slicing can be performed on Series objects with a non-integer index.


In [112]:
# used to demonstrate the next two slices
s = pd.Series(np.arange(0,5),index=['a','b','c','d','e'])
s


Out[112]:
a    0
b    1
c    2
d    3
e    4
dtype: int64

In [113]:
# slicing with integer values will extract items based on position:
s[1:3]


Out[113]:
b    1
c    2
dtype: int64

In [114]:
# with non-integer index, it is also possible to slice with values in the same type of the index:
s['b':'d']


Out[114]:
b    1
c    2
d    3
dtype: int64